NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR Decomposition

https://doi.org/10.18653/v1/2023.findings-emnlp.544

Yadav, Nishant; Monath, Nicholas; Zaheer, Manzil; McCallum, Andrew (December 2023, Findings of the Association for Computational Linguistics: EMNLP 2023)
Bouamor, Houda; Pino, Juan; Bali, Kalika (Ed.)
Cross-encoder models, which jointly encode and score a query-item pair, are prohibitively expensive for direct k-nearest neighbor (k-NN) search. Consequently, k-NN search typically employs a fast approximate retrieval (e.g. using BM25 or dual-encoder vectors), followed by reranking with a cross-encoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a cross-encoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR’s one-time selection of anchors tends to approximate the cross-encoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial end-task: recall of top-k items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important top-k neighbors. It does so by iteratively performing k-NN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and state-of-the-art methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed approach ADACUR consistently reduces recall error—by up to 70% on the important k = 1 setting—while using no more compute than its competitors.
more » « less
Full Text Available
Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization

https://doi.org/10.18653/v1/2022.emnlp-main.140

Yadav, Nishant; Monath, Nicholas; Angell, Rico; Zaheer, Manzil; McCallum, Andrew (December 2022, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing)

Efficient k-nearest neighbor search is a fundamental task, foundational for many problems in NLP. When the similarity is measured by dot-product between dual-encoder vectors or L2-distance, there already exist many scalable and efficient search methods. But not so when similarity is measured by more accurate and expensive black-box neural similarity models, such as cross-encoders, which jointly encode the query and candidate neighbor. The cross-encoders’ high computational cost typically limits their use to reranking candidates retrieved by a cheaper model, such as dual encoder or TF-IDF. However, the accuracy of such a two-stage approach is upper-bounded by the recall of the initial candidate set, and potentially requires additional training to align the auxiliary retrieval model with the cross-encoder model. In this paper, we present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Retrieval is made efficient with CUR decomposition, a matrix decomposition approach that approximates all pairwise cross-encoder distances from a small subset of rows and columns of the distance matrix. Indexing items using our approach is computationally cheaper than training an auxiliary dual-encoder model through distillation. Empirically, for k > 10, our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods that re-rank items retrieved using a dual-encoder or TF-IDF.
more » « less
Full Text Available
Interactive Correlation Clustering with Existential Cluster Constraints

Angell, Rico; Monath, Nicholas; Yadav, Nishant; McCallum, Andrew (July 2022, Proceedings of the 39th International Conference on Machine Learning)

We consider the problem of clustering with user feedback. Existing methods express constraints about the input data points, most commonly through must-link and cannot-link constraints on data point pairs. In this paper, we introduce existential cluster constraints: a new form of feedback where users indicate the features of desired clusters. Specifically, users make statements about the existence of a cluster having (and not having) particular features. Our approach has multiple advantages: (1) constraints on clusters can express user intent more efficiently than point pairs; (2) in cases where the users’ mental model is of the desired clusters, it is more natural for users to express cluster-wise preferences; (3) it functions even when privacy restrictions prohibit users from seeing raw data. In addition to introducing existential cluster constraints, we provide an inference algorithm for incorporating our constraints into the output clustering. Finally, we demonstrate empirically that our proposed framework facilitates more accurate clustering with dramatically fewer user feedback inputs.
more » « less
Full Text Available
Event and Entity Coreference using Trees to Encode Uncertainty in Joint Decisions

https://doi.org/10.18653/v1/2021.crac-1.11

Yadav, Nishant; Monath, Nicholas; Angell, Rico; McCallum, Andrew (November 2021, Proceedings of the 4th Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2021))

Coreference decisions among event mentions and among co-occurring entity mentions are highly interdependent, thus motivating joint inference. Capturing the uncertainty over each variable can be crucial for inference among multiple dependent variables. Previous work on joint coreference employs heuristic approaches, lacking well-defined objectives, and lacking modeling of uncertainty on each side of the joint problem. We present a new approach of joint coreference, including (1) a formal cost function inspired by Dasgupta’s cost for hierarchical clustering, and (2) a representation for uncertainty of clustering of event and entity mentions, again based on a hierarchical structure. We describe an alternating optimization method for inference that when clustering event mentions, considers the uncertainty of the clustering of entity mentions and vice-versa. We show that our proposed joint model provides empirical advantages over state-of-the-art independent and joint models.
more » « less
Full Text Available
Resilience of Urban Transport Network-of-Networks under Intense Flood Hazards Exacerbated by Targeted Attacks

https://doi.org/10.1038/s41598-020-66049-y

Yadav, Nishant; Chatterjee, Samrat; Ganguly, Auroop R. (December 2020, Scientific Reports)

Full Text Available
A Deep Learning Approach to Short-Term Quantitative Precipitation Forecasting

https://doi.org/10.1145/3429309.3429311

Yadav, Nishant; Ganguly, Auroop R. (September 2020, Proceedings of the 10th International Conference on Climate Informatics)
null (Ed.)
Full Text Available
Clustering-based Inference for Zero-Shot Biomedical Entity Linking

Angell, Rico; Monath, Nicholas; Mohan, Sunil; Yadav, Nishant; McCallum, Andrew (January 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)
null (Ed.)
Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clustering-based inference model further improves entity linking by 2.3 points.
more » « less
Full Text Available
Machine Learning for Robust Identification of Complex Nonlinear Dynamical Systems: Applications to Earth Systems Modeling

Yadav, Nishant; Ravela, Sai; Ganguly, Auroop R. (August 2020, ArXivorg)
null (Ed.)
Systems exhibiting nonlinear dynamics, including but not limited to chaos, are ubiquitous across Earth Sciences such as Meteorology, Hydrology, Climate and Ecology, as well as Biology such as neural and cardiac processes. However, System Identification remains a challenge. In climate and earth systems models, while governing equations follow from first principles and understanding of key processes has steadily improved, the largest uncertainties are often caused by parameterizations such as cloud physics, which in turn have witnessed limited improvements over the last several decades. Climate scientists have pointed to Machine Learning enhanced parameter estimation as a possible solution, with proof-of-concept methodological adaptations being examined on idealized systems. While climate science has been highlighted as a "Big Data" challenge owing to the volume and complexity of archived model-simulations and observations from remote and in-situ sensors, the parameter estimation process is often relatively a "small data" problem. A crucial question for data scientists in this context is the relevance of state-of-the-art data-driven approaches including those based on deep neural networks or kernel-based processes. Here we consider a chaotic system - two-level Lorenz-96 - used as a benchmark model in the climate science literature, adopt a methodology based on Gaussian Processes for parameter estimation and compare the gains in predictive understanding with a suite of Deep Learning and strawman Linear Regression methods. Our results show that adaptations of kernel-based Gaussian Processes can outperform other approaches under small data constraints along with uncertainty quantification; and needs to be considered as a viable approach in climate science and earth system modeling.
more » « less
Full Text Available
Deep Transfer Learning on Satellite Imagery Improves Air Quality Estimates in Developing Nations

https://doi.org/10.48550/arXiv.2202.08890

Yadav, Nishant; Sorek-Hamer, Meytar; Von Pohle, Michael; Asanjan, Ata Akbari; Sahasrabhojanee, Adwait; Suel, Esra; Arku, Raphael; Lingenfelter, Violet; Brauer, Michael; Ezzati Majid; et al (February 2022, ArXivorg)

Urban air pollution is a public health challenge in low- and middle-income countries (LMICs). However, LMICs lack adequate air quality (AQ) monitoring infrastructure. A persistent challenge has been our inability to estimate AQ accurately in LMIC cities, which hinders emergency preparedness and risk mitigation. Deep learning-based models that map satellite imagery to AQ can be built for high-income countries (HICs) with adequate ground data. Here we demonstrate that a scalable approach that adapts deep transfer learning on satellite imagery for AQ can extract meaningful estimates and insights in LMIC cities based on spatiotemporal patterns learned in HIC cities. The approach is demonstrated for Accra in Ghana, Africa, with AQ patterns learned from two US cities, specifically Los Angeles and New York.
more » « less
Full Text Available
SUBSUME: A Dataset for Subjective Summary Extraction from Wikipedia Documents

https://doi.org/10.18653/v1/2021.newsum-1.14

Yadav, Nishant; Brucato, Matteo; Fariha, Anna; Youngquist, Oscar; Killingback, Julian; Meliou, Alexandra; Haas, Peter (January 2021, New Frontiers in Summarization workshop (at EMNLP 2021))

Full Text Available

« Prev Next »

Search for: All records